Categorisation by Context

ثبت نشده

چکیده

Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of documents. Classification is traditionally performed by extracting information for indexing a document from the document itself. The paper describes the technique of FDWHJRULVDWLRQ E\\ FRQWH[W, which exploits the context perceivable from the structure of HTML documents to extract useful information for classifying the documents they refer to. We present the results of experiments with a preliminary implementation of the technique. [Excite]) perform search based on the content of documents and provide results as a linear list of such documents, typically ranked in order of relevance. The often unsatisfactory aspect of this approach is that the list can be quite long, with many replications, and without any indication of possible grouping of related material. For instance, issuing a query with the keyword " garbage " , one would obtain a list of documents that discuss ecological issues interspersed with documents about garbage collection in programming languages. Splitting the list of retrieved documents into thematic categories would significantly facilitate selecting those documents of more interest to the user. Notable exceptions to this approach are Lycos™ [Lycos] and Yahoo™ [Yahoo], which maintain a categorisation of part of their search material. Actually Yahoo gave up its general search service in favour of Altavista [Altavista] and supports only searches within its own catalogue. This allows a more focused search restricted to the documents within a given category and also the results of a query are presented arranged within subcategories. However both Lycos and Yahoo are based on manual categorisation of documents performed by a small set of well-trained categorisation technicians (even though Lycos™ recently announced the development of an automatic classifier). It is questionable whether manual classification will be able to scale well with the growth of the Web, which will reportedly reach over 30 terabytes within 2 years, a size larger than the whole US Library of Congress. First, manual classification is slow and expensive, since it relies on skilled manpower. Second, the consistency of categorisation is hard to maintain when different human classifiers are involved. Categorisation is quite a subjective task, as other content related tasks …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exemplar-based pitch accent categorisation using the generalized context model

This paper presents the results of a pitch accent categorisation simulation which attempts to classify L*H and H*L accents using a psychologically motivated exemplar-theoretic model of categorisation. Pitch accents are represented in terms of six linguistically meaningful parameters describing their shape. No additional information is employed in the categorisation process. The results indicate...

متن کامل

Interplay between semantic and emotional information in visual scene processing

We examined whether and how image’s semantics and emotion content interact during visual processing. In each trial, we briefly presented two emotional or neutral images (a scene context and an object), manipulating the semantic consistency and the emotional consistency of the pair. Participants categorised one image semantically or emotionally. Semantic categorisation was overall better than em...

متن کامل